Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
#Libraries to help with reading data and manipulation
import numpy as np
import pandas as pd
#Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
#to scale the data using z score
from sklearn.preprocessing import StandardScaler
#to compute distances
from scipy.spatial.distance import cdist
from scipy.spatial.distance import pdist
#to perform hierarchical clustering, compute cophenetic coorelation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
#to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
#to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
#to perform pca
from sklearn.decomposition import PCA
#to suppress warnings
import warnings
warnings.filterwarnings("ignore")
data=pd.read_csv("stock_data.csv")
data.shape
(340, 15)
#viewing a random sample of the dataset
data.sample(n=10, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000000 | -15.478079 | 2.923698 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 4.065823e+08 | 93.089287 | 1.785616 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.029999 | 7.578608 | 1.163334 | 15 | 47 | 21818000 | 248710000 | 1.52 | 1.636250e+08 | 26.335526 | -1.269332 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410004 | 10.660538 | 0.806056 | 17 | 129 | 160383000 | 636056000 | 3.28 | 4.913916e+08 | 24.070121 | -2.256747 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610001 | 0.865287 | 1.589520 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 7.652981e+08 | 93.089287 | 3.345102 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.130001 | -21.253771 | 2.364883 | 2 | 201 | 523803000 | 85171000 | 0.56 | 1.520911e+08 | 93.089287 | 9.567952 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.040001 | -5.101751 | 1.966062 | 4 | 189 | 7786000000 | -671000000 | -0.79 | 8.493671e+08 | 93.089287 | 17.345857 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.619999 | 3.653584 | 1.159897 | 20 | 99 | 2288000000 | 2297000000 | 5.14 | 4.468872e+08 | 10.431906 | -0.375934 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480000 | 7.067477 | 1.580839 | 12 | 67 | 412000000 | 968100000 | 2.26 | 4.283628e+08 | 14.814159 | 4.218620 |
df=data.copy()
df.columns=[c.replace(" ","_") for c in df.columns]
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker_Symbol 340 non-null object 1 Security 340 non-null object 2 GICS_Sector 340 non-null object 3 GICS_Sub_Industry 340 non-null object 4 Current_Price 340 non-null float64 5 Price_Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash_Ratio 340 non-null int64 9 Net_Cash_Flow 340 non-null int64 10 Net_Income 340 non-null int64 11 Earnings_Per_Share 340 non-null float64 12 Estimated_Shares_Outstanding 340 non-null float64 13 P/E_Ratio 340 non-null float64 14 P/B_Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
df.drop("Ticker_Symbol", axis=1, inplace=True)
df.duplicated().sum()
0
no duplicate values
df.describe()
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 3.400000e+02 | 3.400000e+02 | 340.000000 | 3.400000e+02 | 340.000000 | 340.000000 |
| mean | 80.862345 | 4.078194 | 1.525976 | 39.597059 | 70.023529 | 5.553762e+07 | 1.494385e+09 | 2.776662 | 5.770283e+08 | 32.612563 | -1.718249 |
| std | 98.055086 | 12.006338 | 0.591798 | 96.547538 | 90.421331 | 1.946365e+09 | 3.940150e+09 | 6.587779 | 8.458496e+08 | 44.348731 | 13.966912 |
| min | 4.500000 | -47.129693 | 0.733163 | 1.000000 | 0.000000 | -1.120800e+10 | -2.352800e+10 | -61.200000 | 2.767216e+07 | 2.935451 | -76.119077 |
| 25% | 38.555000 | -0.939484 | 1.134878 | 9.750000 | 18.000000 | -1.939065e+08 | 3.523012e+08 | 1.557500 | 1.588482e+08 | 15.044653 | -4.352056 |
| 50% | 59.705000 | 4.819505 | 1.385593 | 15.000000 | 47.000000 | 2.098000e+06 | 7.073360e+08 | 2.895000 | 3.096751e+08 | 20.819876 | -1.067170 |
| 75% | 92.880001 | 10.695493 | 1.695549 | 27.000000 | 99.000000 | 1.698108e+08 | 1.899000e+09 | 4.620000 | 5.731175e+08 | 31.764755 | 3.917066 |
| max | 1274.949951 | 55.051683 | 4.580042 | 917.000000 | 958.000000 | 2.076400e+10 | 2.444200e+10 | 50.090000 | 6.159292e+09 | 528.039074 | 129.064585 |
Questions:
#function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data=dataframe
feature:dataframe column
perc:whether to display percentages instead of count (default is false)
n:displays the top n category levels (default is none, i.e., display all levels)
"""
total=len(data[feature])
count=data[feature].nunique()
if n is None:
plt.figure(figsize=(count+1,5))
else:
plt.figure(figsize=(n+1,5))
plt.xticks(rotation=90, fontsize=15)
ax=sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label="{:1f}%".format(
100*p.get_height()/total
) #percentage of each class of the category
else:
label=p.get_height() #count of each level of the category
x=p.get_x()+p.get_width() / 2 #width of the plot
y=p.get_height() #height of the plot
ax.annotate(
label,
(x,y),
ha="center",
va="center",
size=12,
xytext=(0,5),
textcoords="offset points",
) #annotate the percentage
plt.show() #show the plot
def histogram_boxplot(data,feature,figsize=(12,7), kde=False, bins=None):
"""
Boxplot and histogram combined
data=dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde:whether to show thew density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, #Number of rows of the subplot grid= 2
sharex=True, #x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25,0.75)},
figsize=figsize,
) #creating the 2 subplots
sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) #boxplot will be created and a star will indicate the mean value of the column
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, pallette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) #add mean to the histogram
ax_hist2.axvline(
data[feature].median(),color="black", linestyle="-"
) #add median to the histogram
#selecting numerical columns
num_col=df.select_dtypes(include=np.number).columns.tolist()
for item in num_col:
histogram_boxplot(df,item)
labeled_barplot(df, "GICS_Sector")
plt.figure(figsize=(15,7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
df.isna().sum()
Security 0 GICS_Sector 0 GICS_Sub_Industry 0 Current_Price 0 Price_Change 0 Volatility 0 ROE 0 Cash_Ratio 0 Net_Cash_Flow 0 Net_Income 0 Earnings_Per_Share 0 Estimated_Shares_Outstanding 0 P/E_Ratio 0 P/B_Ratio 0 dtype: int64
#variables used for clustering
num_col
['Current_Price', 'Price_Change', 'Volatility', 'ROE', 'Cash_Ratio', 'Net_Cash_Flow', 'Net_Income', 'Earnings_Per_Share', 'Estimated_Shares_Outstanding', 'P/E_Ratio', 'P/B_Ratio']
#scaling the dataset before clustering
scaler=StandardScaler()
subset=df[num_col].copy()
subset_scaled=scaler.fit_transform(subset)
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
#selecting numerical columns
num_col=df.select_dtypes(include=np.number).columns.tolist()
for item in num_col:
histogram_boxplot(df,item)
plt.figure(figsize=(15,7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
clusters=range(1,9)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction=model.predict(subset_scaled_df)
distortion=(
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"),axis=1)
)
/subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2692367155390745 Number of Clusters: 4 Average Distortion: 2.178151429073042 Number of Clusters: 5 Average Distortion: 2.110720186207485 Number of Clusters: 6 Average Distortion: 2.062297686937201 Number of Clusters: 7 Average Distortion: 2.0289794220177395 Number of Clusters: 8 Average Distortion: 1.984517747325883
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
sil_score=[]
cluster_list=list(range(2,10))
for n_clusters in cluster_list:
clusterer=KMeans(n_clusters=n_clusters)
preds=clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score=silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
For n clusters = 2, silhouette score is 0.43969639509980457 For n clusters = 3, silhouette score is 0.4644405674779404 For n clusters = 4, silhouette score is 0.4577225970476733 For n clusters = 5, silhouette score is 0.42436843176418354 For n clusters = 6, silhouette score is 0.3863465606304045 For n clusters = 7, silhouette score is 0.42199780823099103 For n clusters = 8, silhouette score is 0.1430038367496768 For n clusters = 9, silhouette score is 0.40205424663377165
[<matplotlib.lines.Line2D at 0x2980f56e320>]
# Finding optimal no.of clusters with silhouette coefficients
visualizer=SilhouetteVisualizer(KMeans(7,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer=SilhouetteVisualizer(KMeans(6,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
visualizer=SilhouetteVisualizer(KMeans(6,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
visualizer=SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no of clusters with silhouette coefficients
visualizer=SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
#finding optimal no of clusters with silhouette coefficents
visualizer=SilhouetteVisualizer(KMeans(3,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
kmeans=KMeans(n_clusters=5, random_state=0)
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=5, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5, random_state=0)
#adding kmeans clusetr to the original dataframe
df["K_means_segments"]=kmeans.labels_
cluster_profile=df.groupby("K_means_segments").mean()
cluster_profile["count_in_each_segment"]=(df.groupby("K_means_segments")["Earnings_Per_Share"].count().values)
#Lets disply cluster profiles
cluster_profile.style.highlight_max(color="lightgreen",axis=0)
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 246.574304 | 14.284326 | 1.769621 | 26.500000 | 279.916667 | 459120250.000000 | 1009205541.666667 | 6.167917 | 549432140.538333 | 90.097512 | 14.081386 | 24 |
| 1 | 41.373681 | -14.849938 | 2.596790 | 27.285714 | 64.457143 | 34462657.142857 | -1293864285.714286 | -2.459714 | 450100420.905143 | 61.563930 | 2.476202 | 35 |
| 2 | 48.103077 | 6.053507 | 1.163964 | 27.538462 | 77.230769 | 773230769.230769 | 14114923076.923077 | 3.958462 | 3918734987.169230 | 16.098039 | -4.253404 | 13 |
| 3 | 72.783335 | 0.912232 | 2.015435 | 542.666667 | 34.000000 | -350866666.666667 | -5843677777.777778 | -14.735556 | 372500020.988889 | 53.574485 | -8.831054 | 9 |
| 4 | 72.768128 | 5.701175 | 1.359857 | 25.598456 | 52.216216 | -913081.081081 | 1537660934.362934 | 3.719247 | 436114647.527683 | 23.473934 | -3.374716 | 259 |
plt.figure(figsize=(15,10))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i,variable in enumerate(num_col):
plt.subplot(4, 3, i+1)
sns.boxplot(data=df,x="K_means_segments",y=variable)
plt.tight_layout(pad=2.0)
df.groupby("K_means_segments").mean().plot.bar(figsize=(15,6))
<Axes: xlabel='K_means_segments'>
#list of distance metrics
distance_metrics= ["euclidean","chebyshev","mahalanobis","cityblock"]
#list of linkage methods
linkage_methods=["single","complete","average","weighted"]
high_cophenet_corr=0
high_dm_lm=[0,0]
for dm in distance_metrics:
for lm in linkage_methods:
Z=linkage(subset_scaled_df,metric=dm,method=lm)
c,coph_dists=cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(),lm,c
)
)
if high_cophenet_corr < c:
high_cophenet_corr=c
high_dm_lm[0]=dm
high_dm_lm[1]=lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524588. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.792530720285. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180427. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
#printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(),high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
#list of linkage methods
linkage_methods=["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr=0
high_dm_lm=[0,0]
for lm in linkage_methods:
Z= linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print("Cophenetic correlation for {} linking is {}.".format(lm,c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0]="euclidean"
high_dm_lm[1]= lm
Cophenetic correlation for single linking is 0.9232271494002922. Cophenetic correlation for complete linking is 0.7873280186580672. Cophenetic correlation for average linking is 0.9422540609560814. Cophenetic correlation for centroid linking is 0.9314012446828154. Cophenetic correlation for ward linking is 0.7101180299865353. Cophenetic correlation for weighted linking is 0.8693784298129404.
#printing the compbination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
#list of linkage methods
linkage_methods=["single","complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr=0
high_dm_lm=[0,0]
for lm in linkage_methods:
Z=linkage(subset_scaled_df,metric="euclidean",method=lm)
c,coph_dists=cophenet(Z, pdist(subset_scaled_df))
print("Cophentic Correlation for {} linkage is {}.".format(lm,c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0]="euclidean"
high_dm_lm[1]=lm
Cophentic Correlation for single linkage is 0.9232271494002922. Cophentic Correlation for complete linkage is 0.7873280186580672. Cophentic Correlation for average linkage is 0.9422540609560814. Cophentic Correlation for centroid linkage is 0.9314012446828154. Cophentic Correlation for ward linkage is 0.7101180299865353. Cophentic Correlation for weighted linkage is 0.8693784298129404.
#printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
Observations
#list of linkage methods
linkage_methods=["single","complete","average","centroid","ward","weighted"]
#lists to save results of cophenetic correlation calculation
compare_cols=["Linkage","Cophenetic Coefficient"]
compare=[]
#to create a subplot image
fig,axs=plt.subplots(len(linkage_methods),1,figsize=(15,30))
#We will enumerate through the list of linkage methods above
##For each linkage method, we will plot the dendrogram and calculate the cophentic correlation
for i, method in enumerate(linkage_methods):
Z=linkage(subset_scaled_df, metric="euclidean",method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist=cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80,0.80),
xycoords="axes fraction",
)
compare.append([method,coph_corr])
Observations
df_cc=pd.DataFrame(compare, columns=compare_cols)
df_cc
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 0 | single | 0.923227 |
| 1 | complete | 0.787328 |
| 2 | average | 0.942254 |
| 3 | centroid | 0.931401 |
| 4 | ward | 0.710118 |
| 5 | weighted | 0.869378 |
#list of disitance metrics
distance_metrics=["cityblock","euclidean"]
#list of linkage methods
linkage_methods=["average","single"]
#to create a subplot image
fig, axs=plt.subplots(
len(distance_metrics)+len(distance_metrics), 1, figsize=(10,30)
)
i=0
for dm in distance_metrics:
for lm in linkage_methods:
Z=linkage(subset_scaled_df,metric=dm,method=lm)
dendrogram(Z,ax=axs[i])
axs[i].set_title("Distance metric: {}\nLinkage: {}".format(dm.capitalize(),lm))
coph_corr,coph_dist=cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80,0.80),
xycoords="axes fraction",
)
i += 1
Observations
HCmodel=AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
#adding hierarchial cluster labels to the original and scaled dataframes
subset_scaled_df["HC_Clusters"]=HCmodel.labels_
df["HC_Clusters"]=HCmodel.labels_
cluster_profile=df.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"]=(df.groupby("HC_Clusters")["Net_Income"].count().values)
cluster_profile.style.highlight_max(color="lightgreen",axis=0)
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | |||||||||||||
| 0 | 48.006208 | -11.263107 | 2.590247 | 196.551724 | 40.275862 | -495901724.137931 | -3597244655.172414 | -8.689655 | 486319827.294483 | 75.110924 | -2.162622 | 1.620690 | 29 |
| 1 | 326.198218 | 10.563242 | 1.642560 | 14.400000 | 309.466667 | 288850666.666667 | 864498533.333333 | 7.785333 | 544900261.301333 | 113.095334 | 19.142151 | 0.000000 | 15 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 2.000000 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 3.701754 | 285 |
plt.figure(figsize=(15,10))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(4,3,i+1)
sns.boxplot(data=df, x="HC_Clusters", y=variable)
plt.tight_layout(pad=2.0)
Cluster 0
Cluster 1
Cluster 2
Which clustering technique took less time for execution?
Which clustering technique gave you more distinct clusters, or are they the same?
How many observations are there in the similar clusters of both algorithms?
How many clusters are obtained as the appropriate number of clusters from both algorithms?
-